7 research outputs found

    Translating Timing into an Architecture: The Synergy of COTSon and HLS (Domain Expertise: Designing a Computer Architecture via HLS)

    Get PDF
    Translating a system requirement into a low-level representation (e.g., register transfer level or RTL) is the typical goal of the design of FPGA-based systems. However, the Design Space Exploration (DSE) needed to identify the final architecture may be time consuming, even when using high-level synthesis (HLS) tools. In this article, we illustrate our hybrid methodology, which uses a frontend for HLS so that the DSE is performed more rapidly by using a higher level abstraction, but without losing accuracy, thanks to the HP-Labs COTSon simulation infrastructure in combination with our DSE tools (MYDSE tools). In particular, this proposed methodology proved useful to achieve an appropriate design of a whole system in a shorter time than trying to design everything directly in HLS. Our motivating problem was to deploy a novel execution model called data-flow threads (DF-Threads) running on yet-to-be-designed hardware. For that goal, directly using the HLS was too premature in the design cycle. Therefore, a key point of our methodology consists in defining the first prototype in our simulation framework and gradually migrating the design into the Xilinx HLS after validating the key performance metrics of our novel system in the simulator. To explain this workflow, we first use a simple driving example consisting in the modelling of a two-way associative cache. Then, we explain how we generalized this methodology and describe the types of results that we were able to analyze in the AXIOM project, which helped us reduce the development time from months/weeks to days/hours

    From COTSon to HLS: translating timing into an architecture

    Get PDF
    Nowadays, the increasing core number benefits many workloads, but programming limitations to exploiting full performance still remain. A Data-Flow execution model is capable of taking advantage of the full parallelism offered by multicore systems. In such model, the execution can be decomposed in fine-grain threads named Data-Flow Threads (DF-Threads) so that each of them can execute only when their inputs are available. The execution overhead and power consumption is lowered thanks to the reduction of the data push-pull, as well as the burden of thread management

    Reconfigurable Logic Interface Architecture for CPU-FPGA Accelerators

    Get PDF
    Programmable System-on-Chips (SoC) are a flexible solution to offload part of the computational power from CPU to FPGA and accelerate the execution time. In today ARM-based SoCs, CPU and FPGA are usually connected to each other through several different communication links based on AMBA standard. This paper presents two possible design as reconfigurable logic interface architectures to be employed as a high performance interface module in programmable logic accelerators. These designs provide us with programmability for bidirectional data communication paths between CPU memory-mapped master interface and FPGA. Our first proposed design offers up to 32 configurable registers while the other has up to 32 configurable FIFOs to be able to exchange larger data. Both of these architectures communicate to programmble logic accelerators through the data stream channels

    An FPGA-based Scalable Hardware Scheduler for Data-Flow Models

    Get PDF
    This paper presents a scheduler for Data-Flow threads implemented in reconfigurable logic for being deployed on Reconfigurable MPSoCs (i.e., Multi-Processing System on Chips with FPGA). "Data-Flow threads" (DF-Threads) is a novel execution model for mapping threads on local or distributed cores transparently to the programmer. This model is capable of being parallelized massively among different cores and it handles even hundreds of thousands or more Data-Flow threads, and their associated data frames, in order to distribute them both in a local node and through the network to other nodes in a transparent way. The Hardware Scheduler (HS) is designed for being used in Programmable Logic (PL) of MPSoC FPGAs and it deals with the GPP cores, providing them with Data-Flow threads ready to be executed. The overall design is modeled and tested through the HPLabs COTson simulator. Here we use the Block Matrix Multiply benchmark to analyze the potentiality of the proposed model

    A Dynamic Load Balancer for a Cluster of FPGA SoCs

    Get PDF
    In many-core systems to achieve maximum performance, it is desirable to produce many tasks more than the cores and efficiently distribute those tasks among available resources. Software load balancers will provide enough performance as long as the number of jobs is big enough in comparison with the load balancing overhead. To mitigate this overhead, delegating load balancing to an accelerator will improve the performance of such architectures. This paper presents a hardware dynamic load balancer module implemented on the FPGA Zynq Ultrascale+ and is based on the semi-work-stealing2 scheduling. The load balancer is specifically designed for DataFlow-Threads (DF-Threads) and can support multi-core and multi-node computing architectures. The performance of the design is initially examined through a simple “stress-test” that generates threads (the Recursive-Fibonacci program) on a two-nodes FPGA cluster

    Energy Efficiency Exploration on the ZYNQ Ultrascale+

    No full text
    In the context of Cyber-Physical Systems (CPSs), Single Board Computers (SBCs) could provide adaptivity for various present and future applications, and permit scalability through clusters of SBCs while possibly save energy consumption. In this paper, we explore energy efficiency of a Zynq Ultrascale+ based board developed in the context of the AXIOM project. While an entire framework based on the Zynq Ultrascale+ is still in progress, the board is already available and capable of running a full Linux OS and it is possible to measure energy consumption. We demonstrate a possible architecture based on DataFlow-Threads (DF-Threads), a novel execution model, on the Zynq Ultrascale+ platform, in order to assess the energy efficiency of DF-Threads. We measured the power consumption, while the RAW and RDMA message types were transceived through board-to-board interconnects

    AXIOM: a scalable, efficient and reconfigurable embedded platform

    No full text
    Cyber-Physical Systems (CPSs) are becoming widely used in every application that requires interaction between humans and the physical environment. People expect this interaction to happen in real-time and this creates pressure onto system designs due to the ever-higher demand for data processing in the shortest possible and predictable time. Additionally, easy programmability, energy efficiency, and modular scalability are also important to ensure these systems to become widespread. All these requirements push new scientific and technological challenges towards the engineering community. The AXIOM project (Agile, eXtensible, fast I/O Module), presented in this paper, introduces a new hardware-software platform for CPS, which can provide an easy parallel programming model and fast connectivity, in order to scale-up performance by adding multiple boards. The AXIOM platform consists of a custom board based on a Xilinx Zynq Ultrascale+ ZU9EG SoC including four 64-bit ARM cores, the Arduino socket and four high-speed (up to 18 Gbps) connectors on USB-C receptacles. By relying on this hardware, DF-Threads, a novel execution model based on dataflow modality, has been developed and tested. In this paper, we highlight some major conclusions of the AXIOM project, such as the gain in performance compared to other parallel programming models such as OpenMPI and Cilk
    corecore